[RL] pause: use abort pipeline with scheduling loop alive for req drained#7753
[RL] pause: use abort pipeline with scheduling loop alive for req drained#7753jackyYang6 wants to merge 1 commit into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 1 个 Required 任务失败(
2 任务状态汇总2.1 Required任务 : 7/10 通过
2.2 可选任务 — 28/30 通过
3 失败详情(仅 required)Approval — 代码规范(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 xyxinyang 或 zyyzghb review 并 approve 关联变更: PR 在 pause 逻辑中新增了 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7753 +/- ##
==========================================
Coverage ? 63.17%
==========================================
Files ? 461
Lines ? 64121
Branches ? 9821
==========================================
Hits ? 40506
Misses ? 20840
Partials ? 2775
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8fec3b3 to
7d634d5
Compare
7d634d5 to
8ea07fe
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 15:51:59
📋 Review 摘要
PR 概述:重写 _control_pause() 实现两阶段暂停机制:先通过 abort 管道优雅中止所有请求(返回部分结果),再暂停调度循环,解决 RL 场景下 abort+pause 的死锁与丢失 partial result 问题。
变更范围:engine/common_engine.py、engine/sched/resource_manager_v1.py、entrypoints/、router/、相关测试
影响面 Tag:[Engine] [APIServer] [Scheduler]
📝 PR 规范检查
描述结构合规(5 个必填 section 均存在且有内容)。
标题 tag [RL] 是合法官方 Tag,但按 architecture.md 影响面判断表,[RL] 对应 fastdeploy/rl/,本 PR 实际改动集中在 fastdeploy/engine/ 和 fastdeploy/entrypoints/,建议改为 [Engine](PR checklist 中作者自己也注明了 [RL], [Engine] 两个标签,Engine 更能准确描述变更范围)。
标题建议(可直接复制):
[Engine] pause: use abort pipeline with scheduling loop alive for graceful pause in RL scenarios
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/engine/common_engine.py:1326 |
req_id 为 None 时警告后未 continue,None 被写入 waiting_abort_req_id_set |
| 🟡 建议 | fastdeploy/engine/common_engine.py:1504 |
_wait_inflight_drained() 无超时保护,worker 故障时控制线程永久卡死无法恢复 |
| ❓ 疑问 | fastdeploy/engine/common_engine.py |
PR 描述 execution flow 列出的 handle scheduler stragglers、_wait_output_queue_empty()、scheduler.reset() 均未在代码中实现,描述与实现存在出入 |
总体评价
两阶段 pause 设计思路清晰,从根源解决了死锁与 partial result 丢失问题,abort 管道复用路径经过功能验证。存在一处 P0 bug(null req_id 未 continue 导致无效 abort 进入管道),以及无超时保护的潜在运维风险需关注。
| "Receive abort request without request_id, skip invalid abort message" | ||
| ) | ||
| self.llm_logger.info(f"Receive abort request, req_id: {req_id}") | ||
| self.resource_manager.add_abort_req_ids(req_id) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
8ea07fe to
ba85acb
Compare
…l termination Replace the old preempted_all + error_response approach in _control_pause with a two-phase design: Phase 1: Block new requests via _rejecting_new_requests (NOT is_paused) - Scheduling loop keeps running so _trigger_abort can process - add_abort_req_ids(ALL) marks all requests for abort - Scheduling loop catches them via _trigger_abort as they cycle through Phase 2: After drain, set is_paused=True to fully stop scheduling loop - Handle scheduler-only stragglers with direct _send_error_response - Wait for output queue empty, then reset Depends-on: PaddlePaddle#7615 (refact abort_requests to fire-and-forget)
ba85acb to
3ef80aa
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 16:20:55
📋 Review 摘要
PR 概述:重写 _control_pause() 实现,解决 RL 场景下 abort pipeline 死锁和丢弃 partial result 问题。
变更范围:fastdeploy/engine/common_engine.py、tests/engine/test_common_engine.py
影响面 Tag:[Engine] [RL]
📝 PR 规范检查
PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 全部存在),Checklist 中 pre-commit 未勾选,请在合入前补充执行。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/engine/common_engine.py:1491 |
_wait_inflight_drained() 无超时,abort pipeline 卡死时永久阻塞 |
| ❓ 疑问 | fastdeploy/engine/common_engine.py:1463 |
PR 执行流描述了 scheduler.reset(),但代码中缺失;scheduler.responses 可能有残留 |
| ❓ 疑问 | fastdeploy/engine/common_engine.py |
PR Modifications 表格声明了新方法 _wait_output_queue_empty(),但 diff 中完全没有该方法实现 |
| ❓ 疑问 | fastdeploy/engine/common_engine.py |
token_processor.clear_data() 被移除,无注释说明 abort pipeline 是否覆盖了其清理职责 |
总体评价
整体设计思路清晰,两阶段分离(reject 与 pause)有效解决了死锁问题,accuracy tests 通过验证了功能正确性。但存在 PR 描述与实际实现的多处不一致(scheduler.reset()、_wait_output_queue_empty() 描述有但代码没有),建议作者补充说明或补齐实现;_wait_inflight_drained() 的无超时设计建议增加兜底保护。
| No timeout — abort pipeline will complete. Aligned with SGLang's poll-until-drained. | ||
| """ | ||
| start_time = time.time() | ||
| while ( |
There was a problem hiding this comment.
🟡 建议 _wait_inflight_drained() 无超时机制,可能导致永久阻塞
原代码对 worker queue 等待有 60s 超时并 raise Exception,新设计完全去掉了超时保护。注释说 "No timeout — abort pipeline will complete",但若 abort pipeline 因 bug 或异常卡住(如 worker hang、ZMQ 消息丢失),_control_pause() 将永远阻塞,上游 RL 框架调用方无法感知,造成静默挂起。
建议添加兜底超时:
DRAIN_TIMEOUT = 120
start_time = time.time()
while (self.resource_manager.requests or self.scheduler.requests
or self.resource_manager.waiting_abort_req_id_set
or self.resource_manager.to_be_aborted_req_id_set):
if time.time() - start_time > DRAIN_TIMEOUT:
self.llm_logger.error(f"Drain timed out after {DRAIN_TIMEOUT}s, abort pipeline may have stalled!")
raise TimeoutError(f"_wait_inflight_drained timed out after {DRAIN_TIMEOUT}s")
time.sleep(0.005)| self._send_error_response(req.request_id, "Request is aborted since engine is paused.") | ||
| self.scheduler.reset() | ||
|
|
||
| if envs.ENABLE_V1_KVCACHE_MANAGER: |
There was a problem hiding this comment.
❓ 疑问 PR 执行流描述了 scheduler.reset(),但此处代码缺失
PR 描述的执行流程末尾明确写有 scheduler.reset() + cache reset,但实际只有 cache reset,self.scheduler.reset() 未被调用。
查看 local_scheduler.reset()(line 115-119)的实现,它会清空:
ids_read_cursor(重置为 0)ids(所有历史请求 ID 列表)requests(待处理请求字典)responses(已接收响应字典)
_wait_inflight_drained() 只检查 requests 和 abort 队列为空,不检查 responses。若 scheduler.responses 中有残留未消费数据,resume 后可能产生状态不一致。
请确认:省略 scheduler.reset() 是有意为之(abort pipeline 的正常路径已保证 responses 被消费完),还是遗漏实现?
Depends-on: #7615 (refact abort_requests to fire-and-forget)
Motivation
In RL scenarios, the upstream framework calls
abort_requestfollowed bypauseto stop the engine. The old_control_pauseimplementation had two critical issues:Lost partial results:
preempted_all()+_send_error_response(500)discarded already-inferred tokens, returning error instead of partial results to clients.Deadlock with abort pipeline: Setting
is_paused=Trueat the start blocked the scheduling loop (_pause_cond.wait_for), which prevented_trigger_abortfrom processing abort requests — causing a 30s timeout deadlock.The new design separates "reject new requests" (
_rejecting_new_requests) from "pause scheduling loop" (is_paused), allowing the abort pipeline to complete naturally before engine state reset. This ensures partial inference results are returned to clients viatoken_processor._put_abort_results(200 "Aborted") through the normal output path.Modifications
fastdeploy/engine/common_engine.pyself._rejecting_new_requests = False__init__to decouple request rejection from scheduling loop pauseif self.is_paused or self._rejecting_new_requests:_control_pause()rewrite_wait_inflight_drained()resource_manager.requestsis emptyExecution flow
Usage or Command
Accuracy Tests
Checklist
[RL],[Engine]pre-commitbefore commit.test_control_pause_and_resume_paths)releasebranch, make sure the PR has been submitted to thedevelopbranch.